This workflow reanalyze the adult single nucleus RNA-seq data produced by (Wang et al. 2018), including adult using single-nucleus droplet-based sequencing (snDrop-seq) from PFC (Lake et al. 2018).
In Page 7 of the supplementary materials of (Wang et al. 2018):
For the UMI count-based dataset, we integrated the PsychENCODE adult single-cell profiles for 17,093 cells from dorsolateral prefrontal cortex (DFC) with the published 10,319 adult single-cell data from PFC (20). The integrated UMI dataset includes nine excitatory, ten inhibitory and six non-neuronal cell types (astrocytes, endothelial cells, microglia, oligodendrocytes, OPCs, and pericytes), and a newly discovered excitatory neuronal type (Ex9).
We will first load a few libraries. Among them, * DropletUtils provides functions for data from droplet technologies such as 10X Genomics. * biomaRt provides easy access to databases, such as Ensembl, COSMIC, Uniprot, HGNC, etc. * scater is a collection of tools for doing quality control analyses of scRNA-seq * scran provide functions for normalization of cell-specific libary sizes, correcting batch effects, and identification marker genes
bio_packs = c("SingleCellExperiment","DropletUtils","biomaRt",
"scater","scran","limma","org.Hs.eg.db")
source("https://bioconductor.org/biocLite.R")
for(pack1 in bio_packs){
if( !pack1 %in% installed.packages()[,"Package"]){
biocLite(pack1, suppressUpdates = TRUE)
}
}
cran_packs = c("data.table","svd","Rtsne","stringi","irlba")
for(pack1 in cran_packs){
if( !pack1 %in% installed.packages()[,"Package"]){
install.packages(pack1)
}
}
library(data.table)
library(SingleCellExperiment)
library(DropletUtils)
library(biomaRt)
library(scater)
library(scran)
library(limma)
library(ggplot2)repo_dir = "~/scRNAseq_pipelines"
work_dir = file.path(repo_dir,"psychENCODE")
psychENCODE_dir = "~/psychENCODE_data"Next we import in the count data and other available information. The dataset is available here.
local_counts_fn = file.path(psychENCODE_dir, "DER-22_Single_cell_expression_raw_UMI.tsv")
online_counts_fn = "http://resource.psychencode.org/Datasets/Derived/SC_Decomp/DER-22_Single_cell_expression_raw_UMI.tsv"
if (file.exists(local_counts_fn)) {
counts = fread(local_counts_fn, data.table = FALSE)
} else {
counts = fread(online_counts_fn, data.table = FALSE)
}## Warning in fread(local_counts_fn, data.table = FALSE): Detected 27412
## column names but the data has 27413 columns (i.e. invalid file). Added 1
## extra default column name for the first column which is guessed to be row
## names or an index. Use setnames() afterwards if this guess is not correct,
## or fix the file write command that created the file to create a valid file.
dim(counts); counts[1:3,1:2]## [1] 17176 27413
## V1 Ex3e
## 1 A1BG 0
## 2 A1BG-AS1 0
## 3 A1CF 0
rownames(counts) = counts$V1
counts = as.matrix(counts[,-1])
colnames(counts)[1:10]## [1] "Ex3e" "Ex2" "In1b" "Oligo" "Ex1" "Astro" "Ex8"
## [8] "Astro.1" "Astro.2" "In4b"
# Parse cell ID. Since we used data.table() to read the table as data.table, it has automatically parsed the cell types.
# We need to parse it back as:
# Ex3e.2 -> Ex3e, V2898 -> NA
part_cell_id = sapply(colnames(counts),
function(xx) strsplit(xx,".",fixed=TRUE)[[1]][1],
USE.NAMES=FALSE)
part_cell_id[grep("^V", part_cell_id)] = NA
table(part_cell_id, useNA = "ifany")## part_cell_id
## Astro Endo Ex1 Ex2 Ex3e Ex4 Ex5b
## 2815 549 5059 1428 660 1571 2665
## Ex6a Ex6b Ex8 Ex9 In1a In1b In1c
## 319 1426 320 261 185 348 667
## In3 In4a In4b In6a In6b In7 In8
## 629 249 614 272 1432 305 1344
## Microglia Oligo OPC Per <NA>
## 317 2657 1243 45 32
col_dat = data.frame(sample_name = colnames(counts),
part_cell_id = part_cell_id,
stringsAsFactors = FALSE)
col_dat[1:5,]## sample_name part_cell_id
## 1 Ex3e Ex3e
## 2 Ex2 Ex2
## 3 In1b In1b
## 4 Oligo Oligo
## 5 Ex1 Ex1
We create sce object. We also check for spike-ins.
sce = SingleCellExperiment(assays = list(counts = counts),colData = col_dat)
rm(counts) # save some memory
sce## class: SingleCellExperiment
## dim: 17176 27412
## metadata(0):
## assays(1): counts
## rownames(17176): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(27412): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(2): sample_name part_cell_id
## reducedDimNames(0):
## spikeNames(0):
rownames(sce)[grep("^ERCC",rownames(sce))]## [1] "ERCC1" "ERCC2" "ERCC3" "ERCC4" "ERCC6" "ERCC6L" "ERCC6L2"
## [8] "ERCC8"
We will extract annotation information based on gene names.
anno_file = file.path(work_dir,"gene.annotation.rds")
if( file.exists(anno_file) ){
gene_anno = readRDS(anno_file)
} else{
ensembl = useEnsembl(biomart="ensembl",
dataset="hsapiens_gene_ensembl")
attr_string = c("hgnc_symbol","ensembl_gene_id","external_gene_name",
"chromosome_name", "start_position","end_position","strand",
"description","percentage_gene_gc_content","gene_biotype")
gene_anno = getBM(attributes = attr_string,
filters = "external_gene_name",
values = rownames(sce),
mart = ensembl)
saveRDS(gene_anno, file=anno_file)
}
dim(sce); dim(gene_anno)## [1] 17176 27412
## [1] 18551 10
We remove those genes that are part of extracted annotation, but aren’t in sce.
w2rm = which(!gene_anno$external_gene_name %in% rownames(sce))
w2rm## integer(0)
gene_anno[w2rm,]## [1] hgnc_symbol ensembl_gene_id
## [3] external_gene_name chromosome_name
## [5] start_position end_position
## [7] strand description
## [9] percentage_gene_gc_content gene_biotype
## <0 rows> (or 0-length row.names)
if (length(w2rm) != 0) gene_anno = gene_anno[-w2rm,]
dim(sce); dim(gene_anno)## [1] 17176 27412
## [1] 18551 10
Many genes have multiple entries in the annotation, often because they are annotated to scaffolds, assembly patches and alternate loci. Here we simply remove such entries. The we remove duplicated annotations and genes without annotations.
table(gene_anno$chromosome_name)[1:30]##
## 1 10 11 12
## 1651 697 907 853
## 13 14 15 16
## 375 500 551 699
## 17 18 19 2
## 938 257 1114 1222
## 20 21 22 3
## 469 187 400 1041
## 4 5 6 7
## 668 779 856 845
## 8 9 CHR_HG1_PATCH CHR_HG107_PATCH
## 635 665 20 1
## CHR_HG126_PATCH CHR_HG1311_PATCH CHR_HG1362_PATCH CHR_HG1815_PATCH
## 1 2 6 8
## CHR_HG1832_PATCH CHR_HG2002_PATCH
## 5 6
chr_nms = c(1:22,"X","Y","MT")
gene_anno = gene_anno[which(gene_anno$chromosome_name %in% chr_nms),]
dim(sce); dim(gene_anno)## [1] 17176 27412
## [1] 16971 10
t1 = table(gene_anno$external_gene_name)
t2 = sort(t1[t1 > 1], decreasing=TRUE)
length(t2)## [1] 15
t2[1:10]##
## ABCF2 ATXN7 CCDC39 DIABLO DNAJC9-AS1 GOLGA8M
## 2 2 2 2 2 2
## HSPA14 KBTBD11-OT1 LINC01115 LINC01505
## 2 2 2 2
gene_anno[which(gene_anno$external_gene_name %in% names(t2)[1:4]), 1:4]## hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
## 124 ABCF2 ENSG00000033050 ABCF2 7
## 125 ABCF2 ENSG00000285292 ABCF2 7
## 1792 ATXN7 ENSG00000163635 ATXN7 3
## 1793 ATXN7 ENSG00000285258 ATXN7 3
## 2380 CCDC39 ENSG00000145075 CCDC39 3
## 2381 ENSG00000284862 CCDC39 3
## 4041 DIABLO ENSG00000184047 DIABLO 12
## 4042 DIABLO ENSG00000284934 DIABLO 12
w_duplicate = which(gene_anno$external_gene_name %in% names(t2))
ganno2 = gene_anno[w_duplicate,]
dim(ganno2)## [1] 30 10
table(ganno2$hgnc_symbol == ganno2$external_gene_name)##
## FALSE TRUE
## 10 20
ganno2 = ganno2[which(ganno2$hgnc_symbol == ganno2$external_gene_name),]
dim(ganno2)## [1] 20 10
ganno2 = dplyr::distinct(ganno2,external_gene_name,.keep_all = TRUE)
dim(ganno2)## [1] 15 10
gene_anno = rbind(gene_anno[-w_duplicate,], ganno2)
dim(gene_anno)## [1] 16956 10
table(gene_anno$gene_biotype)##
## 3prime_overlapping_ncRNA antisense
## 4 708
## bidirectional_promoter_lncRNA lincRNA
## 12 641
## macro_lncRNA miRNA
## 1 1
## misc_RNA polymorphic_pseudogene
## 5 7
## processed_pseudogene processed_transcript
## 327 131
## protein_coding scRNA
## 14600 1
## sense_overlapping TEC
## 11 4
## TR_C_gene transcribed_processed_pseudogene
## 3 79
## transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene
## 56 245
## unitary_pseudogene unprocessed_pseudogene
## 7 113
gene_missing = setdiff(rownames(sce),gene_anno$external_gene_name)
gene_missing[1:10]## [1] "AAED1" "AATK-AS1" "AC004448.5" "AC004924.1" "AC005154.6"
## [6] "AC005550.3" "AC007365.3" "AC007386.2" "AC007966.1" "AC008074.3"
length(gene_missing)## [1] 220
w2kp = match(gene_anno$external_gene_name,rownames(sce))
table(is.na(w2kp))##
## FALSE
## 16956
gene_anno$external_gene_name[which(is.na(w2kp))]## character(0)
sce = sce[w2kp,]
dim(sce)## [1] 16956 27412
rowData(sce) = gene_anno
sce## class: SingleCellExperiment
## dim: 16956 27412
## metadata(0):
## assays(1): counts
## rownames(16956): ABCA7 AC009134.1 ... TMSB15B ZNF883
## rowData names(10): hgnc_symbol ensembl_gene_id ...
## percentage_gene_gc_content gene_biotype
## colnames(27412): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(2): sample_name part_cell_id
## reducedDimNames(0):
## spikeNames(0):
rowData(sce)[1:5,]## DataFrame with 5 rows and 10 columns
## hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
## <character> <character> <character> <character>
## ABCA7 ABCA7 ENSG00000064687 ABCA7 19
## AC009134.1 ENSG00000262116 AC009134.1 16
## AAMDC AAMDC ENSG00000087884 AAMDC 11
## AC116050.1 ENSG00000278131 AC116050.1 2
## AC144836.1 ENSG00000262213 AC144836.1 17
## start_position end_position strand
## <integer> <integer> <integer>
## ABCA7 1040101 1065572 1
## AC009134.1 13197606 13204907 -1
## AAMDC 77821109 77918432 1
## AC116050.1 91589494 91621421 1
## AC144836.1 1241939 1254000 -1
## description
## <character>
## ABCA7 ATP binding cassette subfamily A member 7 [Source:HGNC Symbol;Acc:HGNC:37]
## AC009134.1 novel transcript, antisense to SHISA9
## AAMDC adipogenesis associated Mth938 domain containing [Source:HGNC Symbol;Acc:HGNC:30205]
## AC116050.1 otopetrin 1 (OTOP1) pseudogene
## AC144836.1 novel transcript
## percentage_gene_gc_content gene_biotype
## <numeric> <character>
## ABCA7 60.49 protein_coding
## AC009134.1 42.37 antisense
## AAMDC 41.82 protein_coding
## AC116050.1 45.86 unprocessed_pseudogene
## AC144836.1 49.61 lincRNA
Please refer to this workflow in bioconductor for reference.
bcrank = barcodeRanks(counts(sce))
# Only show unique points for plotting speed.
uniq = !duplicated(bcrank$rank)
par(mar=c(5,4,2,1), bty="n")
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex=0.5, cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2,lwd=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2,lwd=2)
legend("left", legend=c("Inflection", "Knee"), bty="n",
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2,lwd=2)bcrank$inflection## Per.9
## 330
bcrank$knee## Oligo.924
## 1615
summary(bcrank$total)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 324 1101 2152 3158 4371 17392
table(bcrank$total >= bcrank$knee)##
## FALSE TRUE
## 10240 17172
table(bcrank$total >= bcrank$inflection)##
## FALSE TRUE
## 4 27408
set.seed(100)
date()## [1] "Sun Feb 24 17:57:52 2019"
e_out = emptyDrops(counts(sce))
date()## [1] "Sun Feb 24 17:58:16 2019"
e_out## DataFrame with 27412 rows and 5 columns
## Total LogProb PValue Limited FDR
## <integer> <numeric> <numeric> <logical> <numeric>
## Ex3e 9224 NaN 1 FALSE 0
## Ex2 5460 NaN 1 FALSE 0
## In1b 5436 NaN 1 FALSE 0
## Oligo 1508 NaN 1 FALSE 1
## Ex1 5715 NaN 1 FALSE 0
## ... ... ... ... ... ...
## Microglia.312 410 NaN 1 FALSE 1
## Microglia.313 484 NaN 1 FALSE 1
## Microglia.314 422 NaN 1 FALSE 1
## Microglia.315 416 NaN 1 FALSE 1
## Microglia.316 706 NaN 1 FALSE 1
length(unique(e_out$FDR))## [1] 2
table(e_out$FDR)##
## 0 1
## 17172 10240
tapply(e_out$Total, e_out$FDR, summary)## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1615 2347 3654 4525 5846 17392
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 324.0 472.0 731.0 865.7 1315.0 1614.0
From the above analysis, some cells with very small number of UMIs. Here we chose do not remove any cells.
We will generate a set of QC features per cell, including the expression of mitochondira genes (there is none here) or ribosomal genes. We identify ribosomal genes based on annotation from https://www.genenames.org/.
ribo_fn = file.path(work_dir,"ribosome_genes.txt")
ribo = read.table(ribo_fn,sep='\t',header=TRUE,stringsAsFactors= FALSE)
ribo[1:2,]## HGNC.ID Approved.Symbol Approved.Name Status Previous.Symbols
## 1 10298 RPL10 ribosomal protein L10 Approved
## 2 10299 RPL10A ribosomal protein L10a Approved NEDD6
## Synonyms Chromosome Accession.Numbers
## 1 NOV, QM, DXS648E, DXS648, FLJ23544, L10 Xq28 AB007170
## 2 Csa-19, L10A 6p21.31 U12404
## RefSeq.IDs Gene.Family.Tag Gene.family.description Gene.family.ID
## 1 NM_006013 RPL L ribosomal proteins 729
## 2 NM_007104 RPL L ribosomal proteins 729
is_mito = which(rowData(sce)$chromosome_name == "MT")
is_ribo = which(rowData(sce)$external_gene_name %in% ribo$Approved.Symbol)
length(is_mito)## [1] 0
length(is_ribo)## [1] 151
sce = calculateQCMetrics(sce, feature_controls=list(Ri=is_ribo))
sort(colnames(colData(sce)))## [1] "is_cell_control"
## [2] "log10_total_counts"
## [3] "log10_total_counts_endogenous"
## [4] "log10_total_counts_feature_control"
## [5] "log10_total_counts_Ri"
## [6] "log10_total_features_by_counts"
## [7] "log10_total_features_by_counts_endogenous"
## [8] "log10_total_features_by_counts_feature_control"
## [9] "log10_total_features_by_counts_Ri"
## [10] "part_cell_id"
## [11] "pct_counts_endogenous"
## [12] "pct_counts_feature_control"
## [13] "pct_counts_in_top_100_features"
## [14] "pct_counts_in_top_100_features_endogenous"
## [15] "pct_counts_in_top_100_features_feature_control"
## [16] "pct_counts_in_top_100_features_Ri"
## [17] "pct_counts_in_top_200_features"
## [18] "pct_counts_in_top_200_features_endogenous"
## [19] "pct_counts_in_top_200_features_feature_control"
## [20] "pct_counts_in_top_200_features_Ri"
## [21] "pct_counts_in_top_50_features"
## [22] "pct_counts_in_top_50_features_endogenous"
## [23] "pct_counts_in_top_50_features_feature_control"
## [24] "pct_counts_in_top_50_features_Ri"
## [25] "pct_counts_in_top_500_features"
## [26] "pct_counts_in_top_500_features_endogenous"
## [27] "pct_counts_in_top_500_features_feature_control"
## [28] "pct_counts_in_top_500_features_Ri"
## [29] "pct_counts_Ri"
## [30] "sample_name"
## [31] "total_counts"
## [32] "total_counts_endogenous"
## [33] "total_counts_feature_control"
## [34] "total_counts_Ri"
## [35] "total_features_by_counts"
## [36] "total_features_by_counts_endogenous"
## [37] "total_features_by_counts_feature_control"
## [38] "total_features_by_counts_Ri"
par(mfrow=c(1,3), mar=c(5, 4, 1, 1), bty="n")
hist(log10(sce$total_counts), xlab="log10(Library sizes)", main="",
breaks=20, col="grey80", ylab="Number of cells")
hist(sce$log10_total_features_by_counts, xlab="log10(# of expressed genes)",
main="", breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_Ri, xlab="Ribosome prop. (%)",
ylab="Number of cells", breaks=40, main="", col="grey80")# hist(sce$pct_counts_Mt, xlab="Mitochondrial prop. (%)",
# ylab="Number of cells", breaks=80, main="", col="grey80")
par(mfrow=c(1,3), mar=c(5, 4, 1, 1), bty="n")
smoothScatter(log10(sce$total_counts), sce$log10_total_features_by_counts,
xlab="log10(Library sizes)", ylab="log10(# of expressed genes)")
smoothScatter(log10(sce$total_counts), sce$pct_counts_Ri,
xlab="log10(Library sizes)", ylab="Ribosome prop. (%)")
# smoothScatter(log10(sce$total_counts), sce$pct_counts_Mt,
# xlab="log10(Library sizes)", ylab="Mitochondrial prop. (%)")
# smoothScatter(sce$pct_counts_Ri,sce$pct_counts_Mt,
# xlab="Ribosome prop. (%)", ylab="Mitochondrial prop. (%)")From the QC results, we will filter on the metrics by identifying outliers using isOutlier.
libsize_drop = isOutlier(sce$total_counts,nmads = 3,type = "lower",log = TRUE)
feature_drop = isOutlier(sce$total_features_by_counts,nmads = 3,
type = "lower",log = TRUE)
mito_drop = rep(FALSE, nrow(sce))
ribo_drop = isOutlier(sce$pct_counts_Ri,nmads = 3,type = "higher")
keep = !(libsize_drop | feature_drop | mito_drop | ribo_drop)## Warning in libsize_drop | feature_drop | mito_drop: longer object length is
## not a multiple of shorter object length
data.frame(ByLibSize = sum(libsize_drop),ByFeature = sum(feature_drop),
ByMito = sum(mito_drop),ByRibo = sum(ribo_drop),
Remaining = sum(keep))## ByLibSize ByFeature ByMito ByRibo Remaining
## 1 0 0 0 998 26414
table(colData(sce)$part_cell_id,keep)## keep
## FALSE TRUE
## Astro 14 2801
## Endo 410 139
## Ex1 24 5035
## Ex2 1 1427
## Ex3e 73 587
## Ex4 7 1564
## Ex5b 21 2644
## Ex6a 1 318
## Ex6b 8 1418
## Ex8 32 288
## Ex9 137 124
## In1a 5 180
## In1b 26 322
## In1c 55 612
## In3 34 595
## In4a 4 245
## In4b 3 611
## In6a 0 272
## In6b 7 1425
## In7 23 282
## In8 38 1306
## Microglia 17 300
## Oligo 48 2609
## OPC 3 1240
## Per 6 39
sce = sce[,keep]
dim(sce)## [1] 16956 26414
rowData(sce)[1:2,]## DataFrame with 2 rows and 18 columns
## hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
## <character> <character> <character> <character>
## ABCA7 ABCA7 ENSG00000064687 ABCA7 19
## AC009134.1 ENSG00000262116 AC009134.1 16
## start_position end_position strand
## <integer> <integer> <integer>
## ABCA7 1040101 1065572 1
## AC009134.1 13197606 13204907 -1
## description
## <character>
## ABCA7 ATP binding cassette subfamily A member 7 [Source:HGNC Symbol;Acc:HGNC:37]
## AC009134.1 novel transcript, antisense to SHISA9
## percentage_gene_gc_content gene_biotype is_feature_control
## <numeric> <character> <logical>
## ABCA7 60.49 protein_coding FALSE
## AC009134.1 42.37 antisense FALSE
## is_feature_control_Ri mean_counts log10_mean_counts
## <logical> <numeric> <numeric>
## ABCA7 FALSE 0.0698599153655333 0.0293269160342217
## AC009134.1 FALSE 0.00211586166642346 0.00091793627520486
## n_cells_by_counts pct_dropout_by_counts total_counts
## <integer> <numeric> <integer>
## ABCA7 1752 93.6086385524588 1915
## AC009134.1 58 99.7884138333577 58
## log10_total_counts
## <numeric>
## ABCA7 3.28239550474253
## AC009134.1 1.77085201164214
summary(rowData(sce)$mean_counts)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001715 0.017401 0.068984 0.186243 0.181089 20.989494
summary(rowData(sce)$mean_counts[rowData(sce)$mean_counts>0])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001715 0.017401 0.068984 0.186243 0.181089 20.989494
summary(rowData(sce)$n_cells_by_counts)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47 453 1702 2919 4048 25645
par(mfrow=c(1,3), mar=c(5,4,1,1))
hist(log10(rowData(sce)$mean_counts+1e-6), col="grey80", main="",
breaks=40, xlab="log10(ave # of UMI + 1e-6)")
hist(log10(rowData(sce)$n_cells_by_counts+1), col="grey80", main="",
breaks=40, xlab="log10(# of expressed cells + 1)")
plot(log10(rowData(sce)$mean_counts+1e-6), pch=16, col=rgb(0,0,0,0.4),
log10(rowData(sce)$n_cells_by_counts + 1),
xlab="log10(ave # of UMI + 1e-6)",
ylab="log10(# of expressed cells + 1)")tb1 = table(rowData(sce)$n_cells_by_counts)
tb1[1:11]##
## 47 48 49 50 51 52 53 54 55 56 57
## 3 3 9 11 8 17 25 12 18 23 16
We remove those genes that are lowly expressed. To filter genes, we follow the threshold to remove genes with two or more UMIs in less than 10 nuclei.
Note that the variable strand need to be renamed, otherwise there is an error message that such a variabel name cannot be used.
names(rowData(sce))[names(rowData(sce)) == "strand"] = "strand_n"
n_genes = colSums(counts(sce) >= 2)
summary(n_genes)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 144.0 332.0 536.2 769.0 3170.0
table(n_genes >= 100)##
## FALSE TRUE
## 5312 21102
table(n_genes >= 200)##
## FALSE TRUE
## 8575 17839
n_genes = colSums(counts(sce) >= 1)
summary(n_genes)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 294.0 789.2 1478.5 1810.8 2612.0 6116.0
table(n_genes >= 100)##
## TRUE
## 26414
table(n_genes >= 200)##
## TRUE
## 26414
n_cells = rowSums(counts(sce) >= 2)
summary(n_cells)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 141.0 835.3 695.0 22946.0
table(n_cells >= 10)##
## FALSE TRUE
## 3567 13389
sce = sce[which(n_cells >= 10),]
dim(sce)## [1] 13389 26414
Next we check those highly expressed genes.
par(mar=c(5,4,1,1))
od1 = order(rowData(sce)$mean_counts, decreasing = TRUE)
barplot(rowData(sce)$mean_counts[od1[20:1]], las=1,
names.arg=rowData(sce)$hgnc_symbol[od1[20:1]],
horiz=TRUE, cex.names=0.8, cex.axis=0.8,
xlab="ave # of UMI")A simple solution for normalization and stablizing expression varaince across genes is to tranform the count data by log(count/size.factor + 1). One may calcualte size.factor per cell as the total number of UMIs, and this assumes the total expression are the same across all the cells. However, the total expression of each cell may vary with respect to cell type and/or cell size, and the computeSumFactors function in R package scran provides a more sophisicated way to calculate size.factor to allow such variaation across cells (Lun, Bach, and Marioni 2016). computeSumFactors can use initial clustering of cells to normalize expression within and beetween clusters. Within a cluster, it estimates the size factor for many groups of cells so that there are more groups than cells, and then it can calcualte the size factor per cell using a lienar deconvolution system. We remove all the cells with negative or very small size factors (< 0.01).
As shown in the following plot, the final size factor estimation is indeed highly correlated with the naive definition by total count.
Finally, the command normalize(sce) adds the normalized expression into the variable sce, which can be accessed by logcounts(sce) = log2(gene_cell_count / size_factor + 1).
date()## [1] "Sun Feb 24 17:58:33 2019"
clusters = quickCluster(sce, min.mean=0.1, method="igraph")
table(clusters)## clusters
## 1 2 3 4 5 6 7 8 9
## 596 1875 2300 5736 752 997 4715 1746 7697
date()## [1] "Sun Feb 24 17:59:31 2019"
sce = computeSumFactors(sce, cluster=clusters, min.mean=0.1)
date()## [1] "Sun Feb 24 18:03:32 2019"
summary(sizeFactors(sce))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05174 0.33275 0.68518 1.00000 1.41138 5.97528
sort(sizeFactors(sce))[1:30]## [1] 0.05173631 0.05645421 0.06262374 0.07171504 0.07367298 0.07448167
## [7] 0.07763261 0.07768411 0.07838728 0.07939418 0.07966941 0.08160297
## [13] 0.08434265 0.08670765 0.08698396 0.08829519 0.08846775 0.08859333
## [19] 0.08864534 0.08887766 0.08928673 0.08934338 0.08937853 0.08988685
## [25] 0.08990635 0.08996416 0.09006206 0.09011824 0.09024464 0.09024836
dim(sce)## [1] 13389 26414
sce = sce[,which(sizeFactors(sce) > 0.01)]
dim(sce)## [1] 13389 26414
par(mfrow=c(1,2), mar=c(5,4,2,1), bty="n")
smoothScatter(sce$total_counts, sizeFactors(sce), log="xy",
xlab="total counts", ylab="size factors")
plot(sce$total_counts, sizeFactors(sce), log="xy",
xlab="total counts", ylab="size factors",
cex=0.3, pch=20, col=rgb(0.1,0.2,0.7,0.3))
abline(h=0.05)dim(sce)## [1] 13389 26414
sce = sce[,which(sizeFactors(sce) > 0.05)]
dim(sce)## [1] 13389 26414
sce = normalize(sce) For dimension reduction, such as calculating PCA or performing TSNE, we should start by identifying a subset of genes with high level of biological signal relative to background (technical) noise. These genes are referred to as HVGs (highly variable genes). The decomposeVar function from R/cran is designed for this task.
new_trend = makeTechTrend(x=sce)
fit = trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(fit$mean, fit$var, pch=20, col=rgb(0.1,0.2,0.7,0.6),
xlab="log(mean)", ylab="var")
curve(fit$trend(x), col="orange", lwd=2, add=TRUE)
curve(new_trend(x), col="red", lwd=2, add=TRUE)
legend("top", legend=c("Poisson noise", "observed trend"),
lty=1, lwd=2, col=c("red", "orange"), bty="n")fit$trend = new_trend
dec = decomposeVar(fit=fit)
top_dec = dec[order(dec$bio, decreasing=TRUE),]
plotExpression(sce, features=rownames(top_dec)[1:10])When performing PCA, we can use all the genes or just those genes with high signal-to-noise ratio (HGVs). TSNE analysis is usually based on the top PCs rather than the original gene expression data. We first perform PCA using all the genes and the function denoisePCA can automatically select the PCs based on modeling of technical noise.
date()## [1] "Sun Feb 24 18:04:22 2019"
sce = denoisePCA(sce, technical=new_trend, approx=TRUE)
date()## [1] "Sun Feb 24 18:05:37 2019"
dim(reducedDim(sce, "PCA"))## [1] 26414 24
plot(log10(attr(reducedDim(sce), "percentVar")), xlab="PC",
ylab="log10(Prop of variance explained)", pch=20, cex=0.6,
col=rgb(0.8, 0.2, 0.2, 0.5))
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")df_pcs = data.frame(reducedDim(sce, "PCA"))
df_pcs$log10_total_features_by_counts = colData(sce)$log10_total_features_by_counts
df_pcs$part_cell_id = colData(sce)$part_cell_id
gp1 = ggplot(df_pcs, aes(PC1,PC2,col=log10_total_features_by_counts)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1gp1 = ggplot(df_pcs, aes(PC1,PC2,col=part_cell_id)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
guides(color = guide_legend(override.aes = list(size=3)))
gp1date()## [1] "Sun Feb 24 18:05:38 2019"
sce = runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100)## Warning: 'rand.seed=' is deprecated.
## Use 'set.seed' externally instead.
date()## [1] "Sun Feb 24 18:06:56 2019"
df_tsne = data.frame(reducedDim(sce, "TSNE"))
df_tsne$log10_total_features_by_counts = colData(sce)$log10_total_features_by_counts
df_tsne$part_cell_id = colData(sce)$part_cell_id
gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features_by_counts)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1gp1 = ggplot(df_tsne, aes(X1,X2,col=part_cell_id)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
guides(color = guide_legend(override.aes = list(size=3)))
gp1We select around top 2,500 genes (based on FDR and biological residual thresholds) as HVGs (highly variable genes).
summary(dec$bio)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.050635 -0.003706 0.001581 0.020546 0.013081 3.103550
dec1 = dec
dec1$bio[which(dec$bio < 1e-5)] = 1e-5
dec1$FDR[which(dec$FDR < 1e-100)] = 1e-100
par(mfrow=c(1,2))
hist(log10(dec1$bio),breaks=100,main="")
hist(log10(dec1$FDR),breaks=100,main="")summary(dec$FDR[dec$bio > 0.01])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000000 0.0001085 0.0000000 0.0437114
summary(dec$FDR[dec$bio > 0.03])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 1.266e-09 0.000e+00 2.109e-06
table(dec$FDR < 1e-10,dec$bio > 0.03)##
## FALSE TRUE
## FALSE 8327 23
## TRUE 3132 1907
table(dec$FDR < 1e-10,dec$bio > 0.01)##
## FALSE TRUE
## FALSE 7944 406
## TRUE 1539 3500
table(dec$FDR < 1e-10,dec$bio > 0.02)##
## FALSE TRUE
## FALSE 8256 94
## TRUE 2558 2481
w2kp = which(dec$FDR < 1e-10 & dec$bio > 0.02)
sce_hvg = sce[w2kp,]
sce_hvg## class: SingleCellExperiment
## dim: 2481 26414
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2481): ADAMTS17 ABCG1 ... ZRANB2 ZNF536
## rowData names(18): hgnc_symbol ensembl_gene_id ... total_counts
## log10_total_counts
## colnames(26414): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(38): sample_name part_cell_id ...
## pct_counts_in_top_200_features_Ri
## pct_counts_in_top_500_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
edat = t(as.matrix(logcounts(sce_hvg)))
edat = scale(edat)
dim(edat)## [1] 26414 2481
edat[1:2,1:3]## ADAMTS17 ABCG1 ADARB2
## Ex3e -0.3253376 -0.1980436 -0.4633208
## Ex2 -0.3253376 -0.1980436 -0.4633208
Perform PCA and use the top 50 PCs for TSNE projection. When calculating PCs, we use log-transformed normalized gene expression data: log2(norm_express+1).
library(svd)
library(Rtsne)
date()## [1] "Sun Feb 24 18:07:05 2019"
ppk = propack.svd(edat,neig=50)
date()## [1] "Sun Feb 24 18:07:16 2019"
pca = t(ppk$d*t(ppk$u)) # calculates pc scores aka principal components
tmp_df = data.frame(pca[,1:2])
names(tmp_df) = paste0("HVG_PC",seq(ncol(tmp_df)))
df_hvg = data.frame(colData(sce)[,c("sample_name","part_cell_id",
"log10_total_features_by_counts")],tmp_df)
rownames(df_hvg) = NULL
set.seed(100)
date()## [1] "Sun Feb 24 18:07:16 2019"
tsne = Rtsne(pca, pca = FALSE)
date()## [1] "Sun Feb 24 18:08:47 2019"
tmp_df = data.frame(tsne$Y)
names(tmp_df) = paste0("HVG_TSNE",seq(ncol(tmp_df)))
df_hvg = data.frame(df_hvg,tmp_df)
gp1 = ggplot(df_hvg, aes(HVG_TSNE1,HVG_TSNE2,col=log10_total_features_by_counts)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1gp1 = ggplot(df_hvg, aes(HVG_TSNE1,HVG_TSNE2,col=part_cell_id)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
guides(color = guide_legend(override.aes = list(size=3)))
gp1reducedDims(sce_hvg) = SimpleList(PCA=pca, TSNE=tsne$Y)
sce_hvg## class: SingleCellExperiment
## dim: 2481 26414
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2481): ADAMTS17 ABCG1 ... ZRANB2 ZNF536
## rowData names(18): hgnc_symbol ensembl_gene_id ... total_counts
## log10_total_counts
## colnames(26414): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(38): sample_name part_cell_id ...
## pct_counts_in_top_200_features_Ri
## pct_counts_in_top_500_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
Next we cluster the cells using a simple kmeans method on the top 50 PCs. We set the number of clusters to be 25, the number of clusters that they have found in (Wang et al. 2018).
date()## [1] "Sun Feb 24 18:08:49 2019"
k25_50_pcs = kmeans(reducedDim(sce_hvg, "PCA"), centers=25,
iter.max = 1e8, nstart = 250,
algorithm="MacQueen")
date()## [1] "Sun Feb 24 18:11:39 2019"
names(k25_50_pcs)## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
dim(k25_50_pcs$centers)## [1] 25 50
df_tsne$cluster_kmean = as.factor(k25_50_pcs$cluster)
gp1 = ggplot(df_tsne, aes(X1,X2,col=cluster_kmean)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
guides(color = guide_legend(override.aes = list(size=3)))
gp1Finally we save the sce object and the clustering results.
saveRDS(sce,file.path(psychENCODE_dir,"sce.rds"))
saveRDS(sce_hvg,file.path(psychENCODE_dir,"sce_hvg.rds"))
saveRDS(k25_50_pcs,file.path(psychENCODE_dir,"k25_50_pcs.rds"))sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.15 svd_0.4.1
## [3] limma_3.38.3 scran_1.10.2
## [5] scater_1.10.1 ggplot2_3.1.0
## [7] biomaRt_2.38.0 DropletUtils_1.2.2
## [9] SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0
## [11] DelayedArray_0.8.0 BiocParallel_1.16.6
## [13] matrixStats_0.54.0 Biobase_2.42.0
## [15] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
## [17] IRanges_2.16.0 S4Vectors_0.20.1
## [19] BiocGenerics_0.28.0 data.table_1.12.0
## [21] BiocInstaller_1.30.0 rmarkdown_1.11
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] progress_1.2.0 httr_1.4.0
## [5] dynamicTreeCut_1.63-1 tools_3.5.2
## [7] R6_2.3.0 irlba_2.3.3
## [9] KernSmooth_2.23-15 HDF5Array_1.10.1
## [11] vipor_0.4.5 DBI_1.0.0
## [13] lazyeval_0.2.1 colorspace_1.4-0
## [15] withr_2.1.2 tidyselect_0.2.5
## [17] gridExtra_2.3 prettyunits_1.0.2
## [19] bit_1.1-14 compiler_3.5.2
## [21] BiocNeighbors_1.0.0 labeling_0.3
## [23] scales_1.0.0 stringr_1.4.0
## [25] digest_0.6.18 XVector_0.22.0
## [27] pkgconfig_2.0.2 htmltools_0.3.6
## [29] rlang_0.3.1 RSQLite_2.1.1
## [31] DelayedMatrixStats_1.4.0 bindr_0.1.1
## [33] dplyr_0.7.8 RCurl_1.95-4.11
## [35] magrittr_1.5 GenomeInfoDbData_1.2.0
## [37] Matrix_1.2-15 Rcpp_1.0.0
## [39] ggbeeswarm_0.6.0 munsell_0.5.0
## [41] Rhdf5lib_1.4.2 viridis_0.5.1
## [43] stringi_1.2.4 yaml_2.2.0
## [45] edgeR_3.24.3 zlibbioc_1.28.0
## [47] rhdf5_2.26.2 plyr_1.8.4
## [49] grid_3.5.2 blob_1.1.1
## [51] crayon_1.3.4 lattice_0.20-38
## [53] cowplot_0.9.4 hms_0.4.2
## [55] locfit_1.5-9.1 knitr_1.21
## [57] pillar_1.3.1 igraph_1.2.3
## [59] reshape2_1.4.3 XML_3.98-1.17
## [61] glue_1.3.0 evaluate_0.13
## [63] gtable_0.2.0 purrr_0.3.0
## [65] assertthat_0.2.0 xfun_0.4
## [67] viridisLite_0.3.0 tibble_2.0.1
## [69] AnnotationDbi_1.44.0 beeswarm_0.2.3
## [71] memoise_1.1.0 bindrcpp_0.2.2
## [73] statmod_1.4.30
Lake, Blue B., Song Chen, Brandon C. Sos, Jean Fan, Gwendolyn E. Kaeser, Yun C. Yung, Thu E. Duong, et al. 2018. “Integrative Single-Cell Analysis of Transcriptional and Epigenetic States in the Human Adult Brain.” Nature Biotechnology 36 (1): 70–80. doi:10.1038/nbt.4038.
Lun, Aaron T. L., Karsten Bach, and John C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biology 17 (1): 75. doi:10.1186/s13059-016-0947-7.
Wang, Daifeng, Shuang Liu, Jonathan Warrell, Hyejung Won, Xu Shi, Fabio C. P. Navarro, Declan Clarke, et al. 2018. “Comprehensive Functional Genomic Resource and Integrative Model for the Human Brain.” Science 362 (6420): eaat8464. doi:10.1126/science.aat8464.